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Abstract 

We investigate an extension of eontinuous 
online learning in reeurrent neural network 
language models. The model keeps a sep¬ 
arate veetor representation of the eurrent 
unit of text being proeessed and adaptively 
adjusts it after eaeh predietion. The initial 
experiments give promising results, indi¬ 
eating that the method is able to inerease 
language modelling aeeuraey, while also 
deereasing the parameters needed to store 
the model along with the eomputation re¬ 
quired at eaeh step. 

1 Introduction 

In reeent years, neural network models have 
shown impressive performanee on many natural 
language proeessing tasks, sueh as speeeh reeogni- 
tion (Chorowski et al., 2014; Graves et al., 2013), 
maehine translation (Kalehbrenner and Blunsom, 
2013; Cho et al., 2014), text elassifieation (Le 
and Mikolov, 2014; Kalehbrenner et al., 2014) and 
image deseription generation (Kiros et al., 2014). 
One of the main advantages of these methods is 
the ability to learn smooth veetor representations 
for words, thereby redueing the sparsity problem 
inherent in any natural language dataset. 

Language modelling is another task where neu¬ 
ral networks have delivered exeellent results (Ben- 
gio et al., 2003; Mikolov et al., 2011). Chelba 
et al. (2014) have reeently benchmarked several 
well-known language models by training on very 
large datasets. They found that a recurrent neu¬ 
ral network language model (RNNLM) combined 
with a 9-gram MaxEnt model was able to give the 
best results and lowest perplexity. 

In this work we investigate a possible extension 
of RNNLM, by allowing it to continue learning 
and adapting during testing. The model keeps a 
vector representation of the current sentence that 


is being processed, and continuously modifies it 
based on an error signal. We refer to this as a ver¬ 
sion of online learning, as gradient descent is used 
to optimise the vector even during testing. 

The technique is inspired by work on represen¬ 
tation learning (Collobert and Weston, 2008; Mnih 
and Hinton, 2008; Mikolov et al., 2013), espe¬ 
cially Le and Mikolov (2014) who use a related 
model to learn representations for text classifica¬ 
tion. We extend the idea to recurrent models and 
apply it to the task of language modelling. Our 
results indicate that by exchanging some existing 
model parameters for a component using online 
learning, the system is able to achieve lower per¬ 
plexity while also reducing the necessary compu¬ 
tation. 

2 RNNLM 

We base our implementation of the RNNLM on 
Mikolov et al. (2011), shown in Figure 1. The in¬ 
put layer to the network consists of a 1-hot vec¬ 
tor representing the previous word in the sequence, 
and the hidden vector from the previous time step. 
These are multiplied by corresponding weight ma¬ 
trices and the resulting vectors are passed through 
an activation function to calculate the hidden vec- 



Figure 1: Recurrent neural network language 
model (RNNFM) 



tor at the current time step.' 

Class-based output architecture is used to avoid 
calculating the softmax over all words in the vo¬ 
cabulary. The probability distributions over words 
and classes are calculated by multiplying the hid¬ 
den vector with the corresponding weight matrix 
and applying the softmax function: 

hiddent = cr{E ■ inputt + Wh ■ hiddent-i) 
classes = softmax{Wc • hiddent) 
output = softmax{Wjf' • hiddent) 

(c) 

where a is the logistic function and Wo is the 
weight matrix between the hidden layer and the 
output words in class c. 

Finally, we multiply the probability of the next 
word belonging to class c with the output proba¬ 
bility of the next word given the class to get the 
overall probability of the next word given the pre¬ 
vious words: 

P{wt+i\w{) Ri classesc ■ output^jt+i 

Negative log-probability is used as the loss 
function, which optimises the network to assign 
a high probability to the correct words. The net¬ 
work is trained using gradient descent and back- 
propagation through time. In the basic model, this 
means unrolling the recurrent network for a fixed 
number of time steps, essentially turning it into a 
deep feedforward network which outputs proba¬ 
bility distributions on different layers. Instead of 
using a fixed number of sfeps, our implemenfafion 
unrolls each senfence from fhe lasf word fo fhe firsl 
word, making if more suifable for processing indi¬ 
vidual senfences as opposed fo longer fexfs. 

In addition, we infroduce a special vecfor fo use 
as fhe hidden vecfor af fhe sfarf of each senfence. 
The values in fhis vecfor are freafed as parame- 
fers and opfimised during fraining. This allows fhe 
nefwork fo learn a suifable sfarfing poinf when no 
ofher informafion is available, giving slighf perfor¬ 
mance improvemenfs in our experimenfs. 

3 RNNLM with online learning 

We exfend fhe RNNLM by infroducing an addi¬ 
tional document/confexf vecfor, shown as doc in 
Figure 2. This vecfor will represenf fhe currenf 

'Explicit multiplication for the word vectors can be 
avoided by using data structures that retrieve the correct vec¬ 
tor in constant time. 



Figure 2: RNNLM wifh an addifional documenf 
vecfor for active learning 

documenf being processed, whefher fhaf is a sen- 
fence, paragraph or a larger fexf. When calculaf- 
ing oufpuf probabilities over classes and words, we 
also condifion fhem on fhis new documenf vecfor: 

classes = softmax{Wc ■ hiddent + Wdc ■ doc) 

output = softmax{Wjf^ • hiddent + 

where Wdc is the weighf mafrix befween fhe docu- 

(c) 

menf vecfor and class layer, and is fhe weighf 
mafrix befween fhe documenf vecfor and oufpuf 
words in class c. 

We consfrucf fhe documenf vecfor by freafing 
fhe values as paramefers and optimising fhem dur¬ 
ing bofh fraining and fesfing using backpropaga- 
fion. Af each time sfep, fhe sysfem firsf performs 
a forward pass fhrough fhe nefwork and oufpufs 
probabilify disfribufions over classes and words. 
We fhen use fhe nexf word in fhe sequence fo cal- 
culafe fhe error derivatives in fhe oufpuf and back- 
propagafe fhem back info fhe documenf vecfor. 
The updafe is nof able fo fo affecf fhe oufpuf af 
fhe currenf time sfep, buf if will modify fhe doc¬ 
umenf vecfor which will be used in fhe nexf fime 
sfep. The same word fhaf is used for modifying 
fhe documenf vecfor for fhe nexf fime sfep is also 
available in fhe inpuf layer of fhe nexf fime sfep, 
fherefore fhe sysfem receives no addifional knowl¬ 
edge as inpuf. 

We are inferesfed in modelling individual sen- 
fences, fherefore af fhe beginning of each senfence 
fhe documenf vecfor is resef fo a specific sfarf¬ 
ing sfafe, which is opfimised during fraining and 
shared befween all senfences. During fesfing, fhe 
values in fhe documenf vecfor are confinuously 
modified depending on fhe error derivafives be¬ 
ing backpropagafed from fhe oufpuf layer, while 
all ofher paramefers in fhe model slay conslanl. 



When dealing with larger texts and domain- 
speeifie eorpora, similar ideas of iterative learning 
ean be applied to any language model. After pro- 
eessing a eertain amount of data during testing, a 
new model eould be trained using the previously 
seen testing examples as additional training data. 
Sinee this proeess adds more training data whieh 
is likely to be similar to upeoming testing exam¬ 
ples, the system is likely to aehieve a better per- 
formanee. 

However, when dealing with independent sen- 
tenees, online learning beeomes more diffieult to 
apply. Eaeh sentenee eontains very little addi¬ 
tional data, and even if the language model is 
adjusted after every individual word, it only ob¬ 
tains evidenee of previous words in the sentenee, 
whereas these words are relatively unlikely to oe- 
eur again in the same sentenee. Therefore, instead 
of adjusting individual word representations, our 
approaeh learns a distributed doeument veetor to 
represent the speeifie unit of text that is eurrently 
being proeessed. This veetor is then used as addi¬ 
tional evidenee when ealeulating output probabil¬ 
ities. 

Le and Mikolov (2014) use a similar method 
for learning veetor representations of doeuments 
and paragraphs. They eonstruet a feedforward 
language model and inelude a paragraph veetor 
as an additional veetor in the input layer. The 
model parameters are trained on the training set, 
and when given unseen test data, the system opti¬ 
mises the paragraph veetor aeeording to the error 
signal. They use these veetors as input to a logis- 
tie regression elassifier and aehieve state-of-the-art 
performanee on sentiment elassifieation of movie 
reviews. However, they did not eonsider the effeet 
of this model modifieation direetly on the task of 
language modelling. 

While the system of Le and Mikolov (2014) 
uses a basie feedforward language model, we ex¬ 
tend the idea to reeurrent neural network language 
models, as they are eurrently used in state-of- 
the-art language modelling systems (Chelba et ah, 
2014). Attaehing the doeument veetor to the input 
layer is not preferable for RNNLM, as the error 
is only baekpropagated into the input layer after 
several time steps. When this time step is reaehed 
and the network is unrolled to perform baekprop- 
agation through time, several words have already 
passed without reeeiving any additional informa¬ 
tion. Sinee our implementation performs the un¬ 


rolling only at the end of eaeh sentenee, the up¬ 
dates would not have any effeet. Therefore, we 
attaeh the doeument veetor direetly to the output 
layer, in parallel with the reeurrent hidden eompo- 
nent. Parameters in the doeument veetor ean then 
be updated at eaeh time step, while the unrolling 
and baekpropagation through time still happens at 
the end of the sentenee. 

4 Experiments 

We eonstrueted a dataset from English Wikipedia 
to evaluate language modelling performanee over 
individual sentenees. The text was tokenised, sen¬ 
tenee split and lowereased. The sentenees were 
shuffled, in order to minimise any transfer effeets 
between eonseeutive sentenees, and then split into 
training, development and test sets. The final sen¬ 
tenees were sampled randomly, in order to obtain 
reasonable training times for the experiments. The 
dataset sizes are shown in Table 2. 



Train 

Dev 

Test 

Words 

9,990,782 

237,037 

4,208,847 

Sentences 

419,278 

10,000 

176,564 


Table 2: Dataset sizes 


Model performanee is measured using perplex¬ 
ity, therefore lower values indieate a model whieh 
is able to better prediet the data. Speeial tokens 
are used to mark the beginning and end of a sen¬ 
tenee. The sentenee end token is also ineluded 
in the evaluation, whereas the sentenee start to¬ 
ken is only used as eontext in the input layer. 
Any words that oeeur less than 30 times in the 
training data were replaeed by a speeial token for 
unknown words, leaving a voeabulary of 16,514 
unique words. General learning rate was set to 0.1 
and deereased during training, whereas the learn¬ 
ing rate of the doeument veetor was fixed at 0.1 
for both training and testing. 

As the baseline, we use the regular RNNLM 
with 100-dimensional hidden layers and word vee¬ 
tors (M = 100). In the experiments we inerease 
the eapaeity of the model and measure how that 
affeets the perplexity on the datasets. Eirst, we 
inerease the value of M, allowing more informa¬ 
tion to be stored into word representations, while 
also inereasing the number of hidden-hidden and 
hidden-output eonneetions. As ean be seen in Ta¬ 
ble 1, this improves the overall performanee of the 




Train PPL 

Dev PPL 

Test PPL 

-hParameters 

-hOperations 

Baseline M=100 

92.65 

103.56 

102.51 

- 

- 

M=120 

88.60 

98.78 

97.79 

666,960 

7,400 

M=100, D=20 

87.28 

95.36 

94.39 

332,300 

6,000 

M=135 

85.17 

96.33 

95.71 

1,167,705 

13,475 

M=100, D=35 

80.11 

91.05 

90.29 

581,525 

10,500 


Table 1: Perplexity and additional parameters/operations for different language model configurations 


model - setting M to 120 and 135 leads to pro¬ 
gressively lower perplexity. 

Next, instead of increasing M, we add a D- 
dimensional document vector to the model and use 
this for online learning. When the same num¬ 
ber of elements is added to M or D, our results 
show consistently better performance when using 
the document vector. Increasing M by 35 gives 
perplexity 95.71, whereas using a 35-dimensional 
document vector gives perplexity 90.29. We also 
performed the same experiment using only half 
of the training data, and the difference was even 
larger - 105.50 and 98.23 correspondingly. 

One reason why online learning during model 
deployment is not commonly used is because it 
is computationally expensive. Continuously re¬ 
training the model and adjusting parameters can be 
very time-consuming compared to a simple feed¬ 
forward process through the network. However, 
extra computation is also needed when using a 
hidden vector of size M, as opposed to using a 
smaller value. When increasing the value of M to 
M + X, the RNNLM will contain 

X-C + 2-X-V + 2-X-M + X‘^ 

additional parameters and needs to perform 

2-X-M + X^ + X-C + X- E[0] 

additional operations at each time step.^ C is the 
number of classes, V is vocabulary size, and E[0] 
is the expected number of words that need to be 
processed in the output layer during one step. 

The corresponding number of additional param¬ 
eters in a RNNLM model using a H-dimensional 
document vector for online learning is 

D + D-V + D-C 

^We only count the matrix multiplication operations, as 
they take the majority of the time in a neural network lan¬ 
guage model. 


and additional operations 

2-D- E[0] + 2-D-C 

which includes the error backpropagation at each 
time step. For our experiments V = 16,514, 
C = 100 and E[0] k, 50. Table 1 contains the ad¬ 
ditional values for the experiments, showing that 
replacing some hidden vector parameters with the 
actively learned document vector leads to fewer 
total parameters and fewer operations, along with 
lower perplexity. 

Figure 3 presents the relationship between per¬ 
plexity and the number of additional parameters, 
when increasing either M or D. The results are 
averaged over 10 runs with different random ini¬ 
tialisations. As can be seen, using a small docu¬ 
ment vector lowers the perplexity with fewer pa¬ 
rameters, compared to simply increasing the main 
components of the network. The graph of per¬ 
plexity with respect to additional operations in the 
model also has a very similar shape. 



Figure 3: Perplexity as a function of additional 
parameters when increasing either M or D. The 
x-axis shows the number of additional parame¬ 
ters in the model, with respect to the baseline of 
M = 100, D = 0. The y-axis shows the perplex¬ 
ity on the test set. 






















Both Hufnagel and Marston also joined the long-standing teehnieal death metal band Gorguts. 

1. The band eventually went on to beeome the post-hardeore band Adair. 

2. The band members originally eame from different death metal bands, bonding over a common 
interest in d-beat. 

3. The proceeds went towards a home studio, which enabled him to concentrate on his solo output and 
songs that were to become his debut mini-album ’’Feeding The Wolves”. 

The Chiefs reclaimed the title on September 29, 2014 in a Monday Night Football game against the New 

England Patriots, hitting 142.2 decibels. 

1. He played in twenty-four regular season games for the Colts, all off the bench. 

2. In May 2009 the Warriors announced they had re-signed him until the end of the 2011 season. 

3. The team played inconsistently throughout the campaign from the outset, losing the opening two 
matches before winning four consecutive games during September 1927. 

He was educated at Llandovery College and Jesus College, Oxford, where he obtained an M.A. degree. 

1. He studied at the Orthodox High School, then at the Faculty of Mathematics. 

2. Kaigama studied for the priesthood at St. Augustine’s Seminary in Jos with further study in theology 
in Rome. 

3. Under his stewardship, Zahira College became one of the leading schools in the country. 


Table 3: Examples of using the document vectors to find similar sentences in the development data. 


In order to further explore the relationship be¬ 
tween D and M, we trained a number of smaller 
models with different values, under the constraint 
D + M = 100. To reduce computation time, only 
half of the training data was used in these experi¬ 
ments. The lowest perplexity was achieved in the 
region of ZJ = 23 and M = 77, and making 
the document vectors much smaller or larger led 
to a decrease in performance. This indicates that 
including the document vector does help increase 
model accuracy, but as it contains no information 
about the training data, this vector should be small 
compared to the main model. 

Intuitively, this approach works by having the 
document vector capture the unique aspects of 
each sentence. While the general RNNLM is a 
smooth static representation of the entire training 
data, the document vector is optimised to repre¬ 
sent how each sentence differs from the main lan¬ 
guage model. Therefore we performed a quali¬ 
tative evaluation and found that the learned sen¬ 
tence vectors were also very good predictors of 
semantic similarity. The RNN language model 
was trained on the training set, and then used to 
process the development set. The last state of 
the document vector of each sentence was used to 
calculate cosine similarity. Table 3 contains ran¬ 


domly sampled sentences from the development 
set, together with corresponding development sen¬ 
tences that have the highest similarity (excluding 
the original sentence). Even though there is al¬ 
most no word overlap, the retrieved sentences are 
semantically very similar. 

5 Conclusion 

We have described a possible extension of 
RNNLM which uses continuous online learning. 
The model includes a separate vector to represent 
the unit of text, such as a sentence, being cur¬ 
rently processed. The vector starts in a default 
state and is continuously updated using backprop- 
agation, leading to a more informative representa¬ 
tion. The modified language model achieves lower 
perplexify wifh a more optimal use of parameters. 

The idea of confinuous fraining and adapfafion 
is nafural and also esfablished in biological learn¬ 
ing processes, yef if is nol widely used due fo com- 
pufafional complexify. Our experimenfs indicafe 
fhaf by including Ibis acfive learning componenf 
in fhe neural nefwork model, fhe system is able fo 
achieve higher accuracy, while also decreasing fhe 
parameters needed fo store fhe model and decreas¬ 
ing fhe compulation required. 
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